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Abstract 

In this paper, the problem of fault diagnosis in multiprocessor sys- 
tems is considered under a probabilistic fault model. This work focuses 
on minimizing the number of tests that must be conducted in order 
to correctly diagnose the state of every processor in the system with 
high probability. A diagnosis algorithm that can correctly diagnose the 
state of every processor with probability approaching one in a class of 
systems performing slightly greater than a linear number of tests is pre- 
sented. A nearly matching lower bound on the number of tests required 
to achieve correct diagnosis in arbitrary systems is also proven. Lower 
and upper bounds on the number of tests required for regular systems 
are also presented. A class of regular systems which includes hypercubes 
is shown to be correctly diagnosable with high probability. In all cases, 
the number of tests required under this probabilistic model is shown to 
be significantly less than under a bounded-size fault set model. Because 
the number of tests that must be conducted is a measure of the diag- 
nosis overhead, these results represent a dramatic improvement in the 
performance of system-level diagnosis techniques. 


Index Terms: Algorithms, fault diagnosis, hypercube, multiprocessor systems, 
permanent faults, probabilistic models. 
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1 Introduction 


Highly parallel computer systems, i.e. computer systems containing a large number 
of distinct processing elements, are ( being utilized in a growing number of applica- 
tions. For systems with a large number of processors, automatic fault diagnosis is an 
attractive method of reducing maintenance costs as well as increasing system avail- 
ability. Previous work on multiprocessor system fault diagnosis has been primarily 
concerned with worst-case fault scenarios, leading to overly pessimistic assessments 
of diagnostic capability. The work presented in this paper focuses on evaluation 
of diagnosis strategies under a probabilistic model in which processors are faulty 
with independent and identical probabilities. This approach yields a more realistic 
assessment of diagnostic capability but at the same time increases the complexity 
of the corresponding analysis. 

The problem of multiprocessor system diagnosis has been addressed previously 
from a probabilistic viewpoint in [3,4,6,7,13,15,17,18,19,20], The first paper con- 
cerning probabilistic diagnosis [13] examined heterogeneous systems in which each 
processor has an associated probability of failure. The authors examined the class of 
systems known as p-probabilistically diagnosable systems in which any set of faulty 
processors that has a priori probability greater than or equal to p of occurring is 
uniquely diagnosable. The problem of determining whether a given system is p- 
probabilistically diagnosable has been shown to be co-NP-complete [20] while an 
0(n 3 ) algorithm has been given [6] for determining the most likely fault set of a 
system in the closely related weighted model. In related work, Blount presented a 
method of achieving optimal diagnosis (diagnosis which is correct with maximum 
probability) in a general probabilistic model [4]. Unfortunately, this optimal diagno- 
sis requires exponential time and it was not determined how the quality of diagnosis 
varies with the number of tests conducted. 

In p-probabilistically diagnosable systems, fault sets with probability of occur- 
rence slightly less than p can exist. Hence, the most likely fault set may be only 
slightly more probable than other fault sets, meaning that the probability of choos- 
ing an incorrect fault set may be high. The same may be true even when optimal 
diagnosis can be achieved. In [18], the author examined systems for which the cor- 
rect fault set can be identified with high probability. The model utilized applies to 
homogeneous systems in which each processor has a common probability of failure 
p. An efficient diagnosis algorithm was presented that correctly diagnoses a class of 
systems containing cn log n tests, for c > ^"T, with probability approaching one. 

It was also claimed in [18] that this result was the best possible, i.e. all algorithms 
must have probability approaching zero of achieving correct diagnosis in systems 
containing o(n log n) tests. Unfortunately, due to a subtle flaw in the proof, this 
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result is untrue. This result was also used in [3] to prove a similarly flawed lower 
bound in a more general probabilistic model. 

In this paper, we utilize the model presented in [18]. A counterexample to the 
lower bound claimed in [18] is given in which correct diagnosis is achieved with 
constant probability in a sequence'of digraphs with n - 1 tests. Next, a diagnosis 
algorithm that produces correct diagnosis with probability approaching one in di- 
graphs containing slightly more than a linear number of tests is given. A nearly 
matching lower bound on the number of tests required to achieve correct diagnosis 
with probability approaching one is then proven. Finally, the problem of diagnosis 
in regular systems is considered. A class of systems conducting ©(nlogn) tests in 
which correct diagnosis can be achieved with probability approaching one is pre- 
sented. This class contains the systems given in [18] as well as the important class 
of hypercubes. It is also shown that for regular systems possessing o(n log n) tests, 
all diagnosis algorithms perform poorly. This final result implies that for the im- 
portant class of fixed-degree regular systems, weaker forms of diagnosis must be 

considered. 


2 Preliminaries 

The multiprocessor system model utilized in this paper was proposed in [16]. In 
this model a system is represented as a directed graph with vertices of the digraph 
representing processors in the system and edges of the digraph representing tests 
performed by one processor on another processor. In this section, all basic quantities 
related to this model are defined and a measure of diagnosis algorithm performance 
is presented. 

2.1 Basic Definitions 

For a system composed of n processors, the set of processors is represented by 
U = {ui,...,u n }. It is assumed that these processors are capable of performing 
tests on one another. This situation is represented by a digraph G(U, E ), where the 
vertex set U corresponds to the set of processors of the system and (u, t>) € E if and 
only if processor u tests processor v in the system. Associated with each (u,v) € E 
is a test outcome. This outcome is a 1(0) if u evaluates v as faulty (fault-free). 
A complete collection of test outcomes constitutes a syndrome. Below syndromes, 
fault sets, and other fundamental concepts are defined. 

Definition 1 For a digraph G(U,E), a syndrome is a function from E to {0,1}. 
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Definition 2 For a digraph G(U,E), a fault set is a subset of the vertex set U . 

For a processor u, the tester set consists of the processors that test u, the fail- 
ure set consists of the processors that fail u, and the neighbor set consists of the 
processors that test u along with those that u tests. These quantities are defined 
below. 

Definition 3 For a digraph G(U,E ) and u € U, the tester set of u, denoted by 
r -1 (u), is given by 

r _ 1 (u) = {v € U : («,u) e E } 


Definition 4 For a digraph G(U,E), a syndrome S, and u £ U, the failure set of 
u, denoted by fail, n (u), is given by 

fail in (u)={vGr- 1 (u):5((u,ti)) = l} 


Definition 5 For a digraph G(U,E) and u € U , the neighbor set of u, denoted by 
N(u) } is given by 

N(u) = {u G U : (u,v) € E or (u,u) e E} 

2.2 Diagnosis Algorithm Evaluation 

A fundamental problem in multiprocessor systems is to identify the faulty proces- 
sors in a system given a syndrome. An algorithm for this problem is referred to as a 
diagnosis algorithm. A diagnosis algorithm takes a syndrome as input and outputs 
a subset of the processors in the system. This subset contains exactly the processors 
diagnosed as faulty by the algorithm. Thus, for a set of faulty processors and a syn- 
drome it is possible to evaluate if the output of a deterministic algorithm is correct 
by comparing the algorithm’s output with the set of faulty processors. Syndrome, 
fault set pairs are therefore used as the basic element in the subsequent probabilistic 
analysis of diagnosis algorithm performance. Before proceeding with this analysis 
however, the notion of correct diagnosis must be defined. For a syndrome 5 from a 
digraph G(U, E ), and a deterministic algorithm A, let 

Faulty a{S) = {u E U : Algorithm A diagnoses u as faulty when run on S} 

Thus, Faulty^S) represents the output of Algorithm A when run on syndrome S. 
Using this, the diagnosis quality of an algorithm on a syndrome, fault set pair is 
characterized in Definition 6. 
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Definition 6 For a syndrome, fault set pair ( S,F ) from a digraph G(U, E), a de- 
terministic algorithm A is said to produce 

correct diagnosis if and only if Faulty a{S) — F , 
partial diagnosis if and only if FaultyA(S) C F , and 
false alarm diagnosis if and only if Fau/iy^(S) % F . 

Note that Definition 6 differs from that used in some previous work, e.g. (21], where 
correct diagnosis may allow faulty processors to be identified as fault-free so long 
as no fault-free processor is identified as faulty. In Definition 6, diagnosis is correct 
only when each fault-free processor is identified as fault-free and each faulty pro- 
cessor is identified as faulty. One of the goals of this paper is to provide a rigorous 
foundation for the analysis of the diagnosis problem. To achieve this goal we take 
great care in defining a proper measure of diagnosis algorithm performance as well 
as a probabilistic fault model under which this performance can be evaluated. This 
probabilistic model is presented in the following section. 


3 Probabilistic Model 

In much of the previous work in the system-level diagnosis area, diagnosis algorithm 
evaluation has focused on worst-case performance. Under a bounded-size fault set 
model, correct diagnosis can be guaranteed if the number of faulty processors in the 
system is no greater than some value t < nf 2. Such a model allows any set of t or 
fewer processors in a system to be faulty, including sets that may be extremely rare in 
practice. This approach can therefore lead to an overly pessimistic view of diagnosis 
algorithm performance. In this paper, we present a probabilistic model for the faults 
in a system and we use, as a measure of performance, the probability that a diagnosis 
algorithm correctly identifies the faulty processors in the system. This approach 
yields a more realistic assessment of diagnosis algorithm performance by accounting 
for the likelihood of occurrence of the fault sets in a system. In our probabilistic 
fault model, processors are faulty with probability p independently of one another, 
fault-free processors always produce the correct outcome when performing a test, 
and no assumptions are made concerning the outcomes of tests performed by faulty 
processors. It will be shown in this paper that in contrast to the bounded-size fault 
set model, correct diagnosis can be achieved with high probability in this model at 
relatively low cost. 

Some comments concerning the behavior of faulty processors under this model 
are in order. We make no assumptions concerning the outcomes of tests performed 
by faulty processors. Thus, faulty processors can pass or fail other processors in 
virtually any manner. For example, faulty processors can: 
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1. always fail other processors, 

2. always pass other processors, 

3. fail other processors with some probability, 

4. collaborate with other faulty processors through their test outcomes in an 
attempt to confuse the diagnosis algorithm, or 

5. combine any or all of the above behaviors. 

Since these as well as any other behaviors are allowed under this model, this is 
equivalent to assuming that the faulty processors produce test outcomes in the 
most detrimental manner. The diagnosis algorithms we present in this paper are 
shown to produce correct diagnosis with high probability under any of these faulty 
processor behaviors and are therefore very robust. We also show that the set of 
systems for which these algorithms work contains systems which are very nearly the 
sparsest possible under this model and hence, significant improvements can only be 
achieved by restricting the behavior of faulty processors. With this in mind, we now 
present the probability model. 

For a digraph G(U , E), the sample space fi g(U,E) °f probability model con- 
sists of all syndrome, fault set pairs in that digraph, i.e. 

n G([/ J 2 ) = {(5, F) : F c U and S' is a function from E to {0, 1}}. 

Since no assumptions are made concerning the outcomes of tests performed by faulty 
processors, the probability of a particular syndrome given a fault set may not be 
specified in this model. The basic events of the model consist of sets of syndrome, 
fault set pairs which have the same fault set and whose syndromes are identical 
except for the labels on edges out of faulty processors. Formally, a syndrome, fault 
set pair (S', F') is contained in a basic event B defined as follows: 

B = {(5, j F) : F = F 1 and V(u, v) £ E with u € U — F, S’((u, v)) = S ((u, v))} 

Note that there is a unique fault set associated with each basic event but that each 
event may contain many distinct syndrome, fault set pairs. Now, let 

B G (u,E) = {B : B is a basic event of G(U, E)}. 

The family of events T G {u,E) in this probability space is the set of all subsets of 
Bg[u,e)- 

Definition 7 A syndrome, fault set pair ( S,F ) in a digraph G(U,E) is said to be 
incompatible if and only if 3u,v e U such that uGU - F, (u, v) € E, and 
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1. V G U - F and S((u,t>)) = 1, or 

2. v G F and 5((u,t;)) = 0. 


A syndrome, fault set pair which is npt incompatible is said to be compatible. A basic 
event is said to be incompatible if its syndrome, fault set pairs are incompatible, 
otherwise it is compatible. The probability of a basic event B in a digraph G(U , E) 
is defined as follows: 


Pg(B) = 


0 if B is incompatible 
pl F l(l — p) n ~\ F \ otherwise 


where F represents the unique fault set associated with B. Clearly, 

£ Pg(b) = i 

BGBa{U t E) 


and, hence, this is a legitimate probability measure. 

The primary measure of the performance of a diagnosis algorithm used in this 
paper is the probability that the algorithm produces correct diagnosis. For a digraph 
G(U y E) and a deterministic algorithm A, let 

Correct^ (A) = {(5, F) : FaultyA{S) = F} 

and let NotCorrectc;(A) represent the complement of Correctc(A). Thus, 
CorrectG(A) represents the set of all syndrome, fault set pairs in a digraph for 
which Algorithm A produces correct diagnosis. Note that it may be the case that 
Correct^ (A) g ?G(U t E) ln which case P^Correct^A)) w iH not be defined. The 
output of a particular diagnosis algorithm may depend on the outcomes of tests 
performed by faulty processors and thus, the probability of correct diagnosis for the 
algorithm cannot be determined until a probability distribution on these edges is 
specified. 

For a digraph G(U y E), let Pq be a probability function defined on Cl g(u,e) 
such that the family of events is equal to all subsets of and V5 G B<7(t/,E)> 

Pq(B) = Pg{B). Such a probability function will be referred to as a refinement of 
Pg . Now, let Pq represent the set of all refinements of Pg . Since any type of behavior 
of the faulty processors is allowed in this model, the probability of correct diagnosis 
for a deterministic algorithm A in a digraph G(U, E) } denoted by DiagProb^A) is 
defined to be 

DiagProbc(A) = min P^(CorrectG(A)) = min P<s((S,jF)) 

P G ePG P a eP G (S,F)eCoTrect a {A) 
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Thus, when calculating the probability of correct diagnosis for an algorithm it is as- 
sumed that the faulty processors perform their tests in the manner most detrimental 
to the algorithm. We may also define this diagnosis probability for probabilistic diag- 
nosis algorithms. Given a syndrome 5, a probabilistic diagnosis algorithm A chooses 
a fault set F with some probability call it pa,s{F ) where TIfc u Pa,s{F ) — 1- Thus, 
for a digraph G(U, E) and a probabilistic diagnosis algorithm A, the probability of 
correct diagnosis for Algorithm A is defined to be 

DiagProb G (A) = min ^ F )) ’ Pa,s{F) 

p a ep o ( 5 ,F)en G 

4 Diagnosis Using n-1 Tests 

In [18], an efficient diagnosis algorithm that achieves correct diagnosis with probabil- 
ity approaching one in sequences of digraphs containing cn log n edges, for c > j— t, 

was presented. It was also claimed in [18] that all diagnosis algorithms must have 
a probability of correct diagnosis that approaches zero for digraphs containing 
o(n log n) edges. In this section, a sequence of digraphs containing n-1 edges 
is exhibited for which a simple diagnosis algorithm can achieve correct diagnosis 
with constant probability, thereby providing a counter-example to this claim. 

Consider a sequence of digraphs G n (U n , E„) with U n = {ui,...,u n } and E n 
defined as follows: 

E n = {(til, « 2 ), («i, «3), - - »(«l) U n-l),(Ml,Un)}, 

i.e. ui tests all other processors. Now, consider the following simple diagnosis 
algorithm. 

Algorithm Naive 

Input: ■ A syndrome S in a digraph G(U,E). 

Output: A set F C U . 

F *- 0 

for each v € {« 2 , U 3 , . . . , u n } 

if S((tti,t/)) = 1 then F <- F U {v} 

Algorithm Naive simply assumes that ui is fauit-free and diagnoses a processor 
as faulty if and only if it is failed by uj. Clearly, if uj is faulty, Algorithm Naive 
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incorrectly diagnoses Ui itself. If uj is fault-free however, Algorithm Naive produces 
correct diagnosis. Thus, VP^ n G Pg k 

Pg n (Correct^ (Naive)) = P (; n {{( S ’ F ) : u * is fault-free}) 

. = 1 ~P 


and therefore 


DiagProb ffri (Naive) = 1 - p. 

Thus, this simple diagnosis algorithm produces correct diagnosis with constant prob- 
ability in a sequence of digraphs containing exactly n - 1 edges. 


5 A Majority-Vote Algorithm 

In this section, a simple yet powerful diagnosis algorithm known as Algorithm Ma- 
jority is presented. In Algorithm Majority a processor is diagnosed as faulty if and 
only if it is failed by more than 1/2 the processors in its tester set. 

Algorithm Majority 

Input: A syndrome 5 in a digraph G(U,E). 

Output: A set F C U . 

F <- 0 

for each u G U 

if |fail, n (u)| > then F <— F U {u} 


Theorem 1 For a digraph G(U,E), Algorithm Majority has a time complexity of 
0(|£j) and a space complexity of 0(\E\). 

Proof: The failure set cardinalities as well as the tester set cardinalities can be 

calculated in a single traversal of the labeled adjacency lists of the digraph. This 
requires Ofl£|) time. The only storage requirement for the algorithm aside from 
the input and output is a set of temporary variables to hold these values as they are 
calculated. Hence, the space complexity is also Ofll?|). I 

Algorithm Majority is slightly more sophisticated than Algorithm Naive. Rather 
than blindly believing the test outcomes of a single processor, it relies on a majority- 
vote among the processors in the tester set of a given processor. It should be noted 
that for the special class of systems in which one processor tests every other processor 
and no other tests are conducted, Algorithms Naive and Majority are equivalent. 
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6 Diagnosis in Sparse Systems 

In this section, we examine the problem of correctly diagnosing multiprocessor sys- 
tems having sparse communication networks. First, it is shown that for a class of 
irregularly structured systems utilizing a number of tests growing just faster than n, 
Algorithm Majority correctly diagnoses every processor with probability approach- 
ing one. Next, the probability of correct diagnosis of Algorithm Majority is evaluated 
on some fixed systems which utilize a modest number of tests. Finally, it is proven 
that a linear number of tests are required for any diagnosis algorithm to be capable 
of producing correct diagnosis with high probability. 

6.1 An Upper Bound on the Number of Tests Necessary for Cor- 
rect Diagnosis 

Consider a class of systems in which there is a set of processors known as the testers. 
The systems are such that any processor which is a tester tests all other processors in 
the system (including the other testers). Any processor that is not a tester conducts 
no tests. Thus, a (small) fraction of the processors are relied upon to satisfy all the 
testing requirements of the system. Such a digraph will be referred to as a tester 
digraph, formally defined below. 

Definition 8 A digraph G(U,E) is said to be a tester digraph if and only if 
3 Tg Q U such that 


E = {(u,v) : u € T a ,v G U, andu^v}. 

The set Tg is known as the testing set of G . 

Figure 1 is an example of a tester digraph with 3 testers and 8 vertices. 

For a tester digraph G(U, E) with testing set Tg, let 

GoodMaj c = {(5, F) : \T G n {U — F)\ > ^ and ( S , F) is compatible} 

Thus, GoodMaj G represents the set of compatible syndrome, fault set pairs in which 
more than 1/2 the testers are fault- free. The following lemma shows that if the 
majority of testers in a tester digraph are fault-free, then the diagnosis of Algorithm 
Majority will be correct. 

Lemma 1 For a tester digraph G(U,E), GoodMaj^ Q Corrects (Majority). 
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Figure 1: A Tester Digraph 


Proof: We will show that if (5, F) G GoodMaj a , then (S, F) € Corrects (Majority) 

and therefore, GoodMaj^ Q Corrects (Majority). 

Consider any (5, F) £ GoodMajs and any u £ U. 
case 1 : u £ (U — Tq) Pi (17 — F) 

Because ( S,F ) is compatible, u must be passed by more than 1/2 the testers, im- 
plying u £ Fau/tt/MajorityCS). Recall that FaultyM^onty{S) is the set of processors 
diagnosed as faulty by Algorithm Majority when run on 5. 

case 2 : u £ (U - To) H F 

Similarly, u must be failed by more than 1/2 the testers implying u G Fau/ft/Majority( < S’)- 
case S : u G Tq H {U — F) 

Here, u can be failed by at most 1/2 the remaining testers. Since Algorithm Majority 
diagnoses a unit as faulty only when it is failed by a strict majority of its tester set, 

U ^ fau/ft/Majority {$) • 

case \ : u G Tq H F 

In this case, u must be failed by more than 1/2 the remaining testers, implying 
U G PuuRt/Majority (*5) ■ 

Hence, Faultyu^or\ty{S) = F and therefore (S, F) G Correct G (Majority). I 
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Thus, if more than 1/2 the testers in a tester digraph are fault-free, Algorithm 
Majority produces correct diagnosis. Theorem 2 shows that if the number of testers 
is given by any unbounded function, this condition will be achieved with probabil- 
ity approaching one and hence the probability of correct diagnosis for Algorithm 
Majority approaches one. 


Theorem 2 Let oj(n) be any unbounded function. If p < 1/2, then for any se- 
quence of tester digraphs on n vertices having w(n) testers, the probability of correct 
diagnosis for Algorithm Majority approaches one as n —>oo. 


Proof: We must show that for any sequence satisfying the theorem condition, 

DiagProb Cn (Majority) -» 1 asn -» oo. If we let X be a random variable representing 
the number of faulty units in the testing set of a tester digraph G , then 

GoodMaj c = {(5, F) : X < —■ and (S,F) is compatible} 

Now, A is a binomial random variable with parameters \T G \ and p. It follows from 
Lemma 1 that VP Gn G P Gn 

P Gn (Correct G „ (Majority)) 

> PcJGoodMajcJ 

Now, since p < 1/2, ± - p > 0, and by the Weak Law of Large Numbers [9], 

P Gn (Correct G „ (Majority)) — y 1 


and therefore 


DiagProb Gn (Majority) — » 1. 


I 


Thus, Algorithm Majority produces correct diagnosis with probability approach- 
ing one in a class of digraphs containing a number of edges given by n-oi(n), where 
w(n) is any function that goes to infinity (arbitrarily slowly) with n. Under a 
bounded-size fault set model a quadratic number of tests are required to withstand 
a linear number of faults while this result shows that in this probabilistic model a 
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p 

To 

0.001 

3 

p.005 

5 

0.010 

5 

0.050 

11 

0.100 

19 

0.200 

41 

0.300 

105 


Table Is Size of Testing Set Required for 
Correct Diagnosis Probability of 0.99 


linear expected number of faults can be tolerated with a number of tests that is arbi- 
trarily close to linear. The maximum degree of the vertices in this class of digraphs 
is large, however, which may be a problem in some applications. This motivates us 
to study the problem of diagnosis in sparse regular systems in Section 7. 


6.2 Performance of Algorithm Majority on Fixed Systems 

In this section, the number of tests required to achieve a given probability of correct 
diagnosis in tester digraphs using Algorithm Majority is examined. For a tester 
digraph G(U, E) with testing set Tg 



DiagProb G (Majority) > ^ 

«=o 



( 1 ) 


Note that the probability of correct diagnosis depends only on the testing set cardi- 
nality and not on n. For a given probability of failure, Inequality 1 can be used to 
determine the number of testers needed for Algorithm Majority to achieve a speci- 
fied probability of correct diagnosis. The size of the testing set required to achieve a 
correct diagnosis probability of 0.99999 for various values of p is shown in Table 1. If 
the probability of failure of a processor is 0.001, Algorithm Majority can achieve cor- 
rect diagnosis with a probability of 0.99999 using three tests per processor regardless 
of the number of processors in the system. For a probability of failure of 0.005 or 
0.010 the tester set need only be of cardinality five for Algorithm Majority to achieve 
a probability of correct diagnosis of 0.99999. Thus, when the probability of failure 


13 




n 

p 

Bounded-size 

Probabilistic 

100 

0.01 

400 

99 

100 

0.10 

1800 

495 

100 

0.30 

4100 

3069 

1000 

0.01 

18,000 

999 

1000 

0.10 

123,000 

4995 

1000 

0.30 

334,000 

30,969 

10,000 

0.01 

1,240,000 

9999 

10,000 

0.10 

10,700,000 

49,995 

10,000 

0.30 

31,070,000 

309,969 


Table 2: Total Number of Tests Necessary for 
Correct Diagnosis Probability of 0.99 


is small correct diagnosis can be achieved with extremely high probability using a 
total number of tests that is near n. When p is larger, more tests are necessary. As 
indicated in Table 1, for a probability of failure of 0.300, more than 100 tests per 
processor are required to achieve correct diagnosis with probability 0.99999. Since 
a large fraction of the processors in the system will be faulty in this situation it is 
to be expected that a larger number of tests are required. The important point is 
that the total number of tests remains proportional to n regardless of the value of 

P- 

In Table 2, we compare the number of tests required under the bounded-size 
fault set model to the number required by Algorithm Majority in order to achieve 
a correct diagnosis probability of 0.99. The number of tests required under the 
bounded-size fault set model was calculated in the following manner. For a given n 
and p, determine t such that the probability of more than t out of the n processors 
being faulty is no greater than 0.01. Table 2 shows the results of this comparison 
for various values of n and p. For large n and small p the number of tests required 
under the probabilistic model is dramatically lower than the number required under 
the bounded-size fault set model. For example, when n = 10,000 and p = 0.10, the 
number of tests required in the probabilistic model is reduced by a factor of 214 
over the bounded-size fault set model. 
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6.3 A Lower Bound on the Number of Tests Necessary for Correct 
Diagnosis 

In this section, a lower bound on the number of tests necessary to achieve correct 
diagnosis with high probability is pjroven. It is shown that if the number of edges in 
an arbitrary sequence of digraphs grows slower than n, then all diagnosis algorithms 
have probability approaching zero of achieving correct diagnosis. This result implies 
that Algorithm Majority achieves a probability approaching one of correct diagnosis 
on systems that are very nearly as sparse as possible. Thus, this relatively simple 
diagnosis algorithm is indeed extremely powerful. 

When the number of edges in a sequence of digraphs grows slower than n, isolated 
processors, t.e. processors which have no incident edges must exist. Intuitively, no 
diagnosis algorithm should be capable of correctly identifying the state of all these 
isolated processors with high probability, making diagnosis in such situations impos- 
sible. This is formally proven in Theorem 3. The essence of the proof of Theorem 3 
can be explained quite simply. To prove that a deterministic diagnosis algorithm 
A has a probability approaching zero of achieving correct diagnosis in a sequence 
of digraphs G n (U n} E n ) } a set of (S, P) pairs disjoint from Correct^ (A) must be 
exhibited that has a probability dominating'the probability of Correct^ (A). For 
a given syndrome from a system with isolated processors, it can be shown that so 
long as the number of isolated processors approaches infinity, the probability of that 
syndrome and a fault set with a particular labeling of the isolated processors is dom- 
inated by the probability of that syndrome and the fault sets in which the isolated 
processors are relabeled in all possible ways. Thus, for any ( S y F ) G Correct^ (A), 
a set of syndrome, fault set pairs disjoint from Correct^ (A) can be exhibited that 
has probability dominating the probability of {S,F). It is also shown that there 
exists a deterministic diagnosis algorithm that has performance at least as good as 
the performance of any probabilistic algorithm, thus completing the proof. 


Theorem 3 Let A be any probabilistic or deterministic diagnosis algorithm. If 
0 < p < l, then for any sequence of digraphs on n vertices having o(n) edges , the 
probability of correct diagnosis for Algorithm A approaches zero as n — > oo. 


Proof: We must show that for any probabilistic or deterministic diagnosis algo- 

rithm A and any sequence of digraphs G„(f7 n , E n ) having |J£ n | G o(n), 
DiagProb Gn (A) — ► 0 as n — ► oo. Assume faulty processors pass all processors they 
test. This yields a refinement P* Gn G Pg„ , where 


PgA(S.F)) 


0 if (5, F) is incompatible or 3u G F, v G U with S((u, t>)) = 1 
p|^|(l _ otherwise 
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Now, let ISO Gn Q U n represent the set of isolated processors, i.e. processors which 
have no incident edges, in G n [U n , E n ). Clearly, 

|JSO Gn | > n - 2\E n \ -» oo. 

For a syndrome, fault set pair ( S,F ) £ Correct^^A) let 

Relabel (SiF ) = {{S', F 1 ) : S' = S,F' ? F, and F - ISO Gn = F' - ISO G „ } 

and let 

AllLabel(5 ^) = Relabel^^j U {(S', F)}, 

Thus, Relabel (5 F) consists of the syndrome, fault set pairs in which the processors 
of ISO Gn are relabeled in all possible ways. Clearly, 

P' Gn (NotCorrectcr„ (A)) 

> E J^„(Relabel (Si F)) 

(S,F)G Correct*^ (A) 

= E [^c„( A 11 V abel ( 5 ^)) - ^((s*^))] 

(5 p F)€CorrectG n (>4) 

and since all processors in the set ISOc n are isolated, 

P Gn {( S > F )) = p |/5 ° c » nF| (l - p) |/SO °" n(f; " _jF)l RG„(AllLabel (s , F) ). 


Therefore, 


and thus 


7: P*G n (AllLabel(5 j?)) 

(5,F)G Correct g„ (>4) 

2^ 0 \ISO Gn nF\n _ p )|/SO G „n(l/„-F)| 

(5,f)eCorrect G „(/l) V 

)gCorrect Gn (>4) ((£» 

[max(p, 1 - p)] |/50<5nl 


P Gn (NotCorrect(j„ (A)) 

' 1 

[max(p, 1 - p)]l /so °"l 


> 


E PgA(S,F)) 

(S^FjGCorrect,^ (A) 
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Therefore, 


^G n ( Correct G„(^)) < 


[max(p, 1 — p)]l /50c "l 

1 - [max(p, 1 — p)]I / 50g ”' 
0 ‘ 


PG n {^OTTeci Gn {A)) 


as n -> oo. Thus, for any algorithm A, DiagProb Cn (A) -+ 0, as well. Now, consider 
any probabilistic diagnosis algorithm A. Then, e Pg„ 


DiagProb Gn (A) < Y2 P G n (( S > F )) ' PA,s{P) 

[s,F)e n G „ 

Consider the deterministic algorithm A' that for any syndrome S chooses fault set 
F such that VP 1 C U n 

PG n ((S,F))>PG n (( S ’F'))- 

Then, if S represents the set of all syndromes in G n 

DiagProb Gn (A) < Y2 Fati/ty^S))) • pa,s{P) 

(s,f) gn G „ 

= EL P^ n ((S,Faulty A ,(S)))-p AiS (F) 

ses Fcu n 

= T, F G n ((s, F *ui t yA'{s))) E p*.s(p) 

ses Fcu n 


= PG n { CoTrect G„{ A ')) 


— ► 0 


I 

A few comments concerning this result are in order. While the theorem implies 
only that under some behavior of the faulty processors, correct diagnosis with high 
probability is impossible to achieve, the result can be shown to hold for all “natural” 
faulty processor behaviors using virtually the same proof. The key to the proof lies in 
the fact that the isolated processors of the system can be relabeled in arbitrary ways 
without affecting the probability of any test outcomes in the system or the status 
of other processors. This holds as long as outcomes of tests performed by faulty 
processors do not depend on the status of these isolated processors. Thus, correct 
diagnosis with high probability cannot be achieved unless the faulty processors are, 
in some sense, clairvoyant. 


7 Diagnosis in Regular Systems 

The study of regular systems is important for several reasons. First, regular designs 
are more easily and efficiently implementable than irregular designs. Furthermore, 
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the majority of existing multiprocessor systems possess a regular structure. Finally, 
assuming the tests are conducted in a set of rounds, the maximum number of tests 
conducted by any processor is a measure of the overhead required to achieve fault 
tolerance. For a fixed total number of tests, regular systems require the minimum 
overhead using this measure. In this section, we examine the diagnosis problem for 
regular systems under our probabilistic model. 

7.1 Upper Bound 

In [18], it was shown that correct diagnosis can be achieved with probability ap- 
proaching one in a class of systems, known as -Dialog r» systems, for c > The 

systems from this class conduct cn log n tests. In this section, we present a class 
of systems conducting cn log n tests which contains the class given in [18] and for 
which Algorithm Majority achieves correct diagnosis with probability approaching 
one. This class of systems contains many useful systems, e.g. hypercubes, which 
are not contained in the Di iC \ osn class. 

The systems studied in this section are those for which every processor in the 
system is tested by at least clog 2 n other processors, for c sufficiently large. This 
includes regular systems with 0(nlog 2 n) tests, of sufficiently large degree. Theo- 
rem 4 shows that for any sequence of these systems, Algorithm Majority will produce 
correct diagnosis with probability approaching one. In order to prove this and sub- 
sequent results, the following corollary [1,8] to a theorem proved by Chernoff [5] is 
needed. 

Corollary 1 Let Y be a binomial random variable with parameters n and p. Then 

[anpj / \ 

P(Y < anp) = E ( " ) P't 1 - P)“~‘ S ■T* 1 - 1 ’"’’' 1 , 0 < a < 1 
P(Y > anp) = E ( 

i=[anp] V 


p (i-pj 


Theorem 4 Let c be any constant such that c > j * 2 2(1 ~p7T ^ 2piJ } 

Ifp< 1/2, then for any sequence of digraphs on n vertices having a tester set of size 
at least clog 2 n for every processor, the probability of correct diagnosis for Algorithm 
Majority approaches one as n — ► 00 . 

Proof: We need to show that for any sequence satisfying the theorem condi- 

tion, DiagProb Gn (Majority) -+ 1 as n -> oo. Intuitively, the worst performance of 
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Algorithm Majority is obtained when faulty processors fail all fault-free processors 
and pass all faulty processors. Let P' Gn £ P Gn be the refinement corresponding to 
this faulty processor behavior. Consider (S'.F) £ Cor rect G „ (Majority) such that 
P' G (( S,F )) > 0. Let B be the basic event with (S, F) € B. Then, V(S',F') £ B, 
( S',F ') € Correct G „ (Majority). Thds, VP Gn € Pg„, 

P Gn (Correct G „ (Majority)) < P Gn (Correct G „ (Majority)) 


and therefore 

DiagProb Gn (Majority) = F Gn (Correct G „ (Majority)) 

Now, let X be a random variable representing the number of units whose tester 
set does not contain a majority of fault-free units. Clearly, 

{(5, F) : X = 0 and ( S , F) is compatible} 

C Correct Gn (Majority) 

Therefore, 

P Gn (Correct G „ (Majority)) 

> P , Gn ({{S>F)-.X = 0» 

> 1~PgM S ’ F ) :X> °V 

> 1 - E[X] 


Now, X - sr=i where 

= 


f 1 if u,- is tested by at least 
| r _ 1 ( u ,)|/2 faulty units 
0 otherwise 


and 


E\X\ = ^E\X t ] 


t=l 

n 


= ’L p cM s ’ F ) :X < = 1 » 


•=i 


E r T,‘ ( |r V(i-p) |r_,(u,)H 

»=1 ■_ j~ |r-M»,)l 1 \ ^ ' 
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NoWj |~l r = [ap|r _1 (u,)H , where a = ^ > 1, and thus, by Corollary 1, 

E [X\ < ^ e ' r ^" |lo8 « 2(1 ~ p)+l ° 8 « 2p| 

t= 1 4 

Since p < 1/2, log e 2(1 — p) + log e 2p < 0 and so 

E[X] < 


ne ~ - [ lo 8« 5TT^T +log * 
wl _«Jaxi[ log , 5^+iog, x] 

0 


as n 


oo, since c 


+ lQ g e 2F7]} 


-l 
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7.2 A Special Case - Hypercube Systems 


In this section, we examine the consequences of Theorem 4 for hypercube systems. 
In a binary hypercube architecture, the constant c is equal to one. Hence, in order for 
the hypercube to be diagnosable with probability approaching one the probability 
of failure p must satisfy 


f log2_e 

l 2 



1 

2(1 -Pi) 



< 1. 


This implies that p must be less than approximately 0.067. This condition is likely 
to be satisfied in the majority of fault environments. The probability of failure can 
be higher in many of the other members of the hypercube family which have c > 1. 

Most of the previous work in the diagnosis area has utilized a bounded-size fault 
set model where it is assumed that no more than t faults occur in the system. A 
system is said to be t- diagnosable if any combination of t faulty units in the system 
can be uniquely diagnosed. It is well known that a fc-dimensional hypercube is k- 
diagnosable but not ( k + l)-diagnosable. Since, k = log 2 n, where n is the number 
of vertices of the cube, the assumptions of the bounded-size fault set model are 
satisfied only when the number of faults is less than or equal to the logarithm of 
the number of units. It is unlikely that this condition will be met in large systems. 
Under the probabilistic model, however, a number of faults that is linear in the 
number of units can be tolerated. 

Table 3 illustrates the diagnosis performance difference between the two models 
on hypercube systems for probabilities of failure of 0.002 and 0.020. The fourth 
column of this table lists the expected number of faulty processors for the corre- 
sponding system and failure probability. P* represents the probability that no more 
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k 

n 

P 

Exp. # faulty 

Pk 

^Maj 

6 

64 

0.002 

0.13 

1.0000 

1.0000 

6 

64 

0.020 

1.28 

0.9997 

0.9999 

8 

256 

0.002 

0.51 

1.0000 

1.0000 

8 

256 

0.020 

5.12 

0.9258 

1.0000 

10 

1024 

0.002 

2.05 

1.0000 

1.0000 

10 

1024 

0.020 

20.48 

0.0079 

1.0000 

12 

4096 

0.002 

8.19 

0.9267 

1.0000 

12 

4096 

0.020 

81.92 

0.0000 

1.0000 

14 

16384 

0.002 

32.77 

0.0002 

1.0000 

14 

16384 

0.020 

327.68 

0.0000 

1.0000 

16 

65536 

0.002 

131.07 

0.0000 

1.0000 

16 

65536 

0.020 

1310.72 

0.0000 

1.0000 

20 

1048576 

0.002 

2097.15 

0.0000 

1.0000 

20 

1048576 

0.020 

20971.52 

0.0000 

1.0000 


Table 3: Diagnosis Probability on a k-dimensional, n-node Hypercube 


than k units are faulty and represents a lower bound on the probability of 

correct diagnosis for Algorithm Majority. Since the diagnosis algorithms proposed 
for the bounded-size fault set model can only guarantee correct diagnosis in a k- 
dimensional hypercube when the number of faults is less than or equal to k i P * is 
an estimate for the probability of correct diagnosis for those algorithms. 

It can be seen from Table 3 that performance under the bounded-size fault set 
model degrades rapidly a s the size of the hypercube increases. The probability of 
correct diagnosis for Algorithm Majority, however, is very nearly one for all the 
hypercubes studied, even when the probability of failure of a processor is as large 
as 0.02. Consider the case where k = 16 and the probability of failure is 0.02. In 
this situation, the expected number of faults is greater than 1300 and yet Algorithm 
Majority still produces correct diagnosis with a probability that is very nearly one. 
Under the bounded-size fault set model, the number of faults is limited to 16 for 
this situation. When k = 16, the number of processors is 65,536. While this 
may seem large, a system containing this many processors, namely the Connection 
Machine [11], has been built. 
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7.3 Lower Bound 

While hypercubes are an important class of system, systems with even fewer con- 
nections are expected to see increased use in future multiprocessor applications. We 
are therefore interested in determining a lower bound on the total number of tests 
necessary to achieve correct diagnosis with high probability. Such a lower bound 
was proven in [2] for regular systems. This result states that all diagnosis algorithms 
must have a probability of correct diagnosis that approaches zero in regular systems 
with o(n log n) tests. This more general probability model contains the model uti- 
lized in this paper as a special case and hence this result holds for this model as 
well. Thus, for the important class of regular systems the algorithm given in [18] 
as well as Algorithm Majority are both optimal to within a constant factor. This 
result also demonstrates that the irregular structure of the tester digraphs studied 
in this paper is a crucial factor in making them amenable to diagnosis. 

Of special interest due to their widespread use are multiprocessor systems which 
are regular and of fixed degree. Included in this class of systems are rings, torii, and 
hexagonal meshes. This somewhat pessimistic result implies that weaker forms of 
diagnosis must be considered for these systems. 

8 Diagnosis using a Linear Number of Tests 

It has been shown that Algorithm Majority can achieve correct diagnosis with prob- 
ability approaching one in digraphs containing nw(n) edges, while all algorithms 
must have probability approaching zero of correct diagnosis in digraphs possessing 
o(n) edges. These results leave open the question of what can be achieved using cn 
edges, for some positive constant c. In this section, it is shown that with cn edges 
Algorithm Majority can achieve a probability of correct diagnosis that is a constant 
arbitrarily close to one. It is also shown that a constant probability less than one 
is the best that any algorithm can hope to achieve in this situation, meaning that 
Algorithm Majority is optimal for digraphs with a linear number of edges. 

The following theorem characterizes the performance of Algorithm Majority on 
digraphs with a linear number of edges. 

Theorem 5 Let e be any real number such that 0 < e < 1. If p < 1/2, Be > 0 such 
that for all sufficiently large tester digraphs having at least c testers , the probability 
of correct diagnosis for Algorithm Majority is at least 1 — e. 

Proof: We must show that, for any e with 0 < e < 1, Be > 0,no such that 

if G n {U ni E n ) is a sequence of tester digraphs with |T<? n | > c, then Vn > no, 
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DiagProb^ (Majority) >1-6. Let a — < 1* Then, VP^ £ Po nJ 

[^J 

^G n ( Correct G n (Majority)) > 1- 

1=0 

by Corollary 1. Now, if c is chosen such that 

-2 In c 

C ~ (1 - a) 2 (l ~P) 

then 

P^ n (Correct^ (Majority)) > 1 - e In£ 

= 1 - € 



(1 — p)*pl r<3 "l" 


I 

Thus, Algorithm Majority can achieve correct diagnosis with probability arbi- 
trarily close to one in sequences of digraphs having a linear number of edges. The 
following theorem shows that all diagnosis algorithms must have a probability of 
correct diagnosis that is bounded away from one by a positive constant in this 
situation. 

Theorem 6 Let c be any positive constant. If 0 < p < 1, 3c > 0 such that for any 
probabilistic or deterministic diagnosis algorithm A and any sufficiently large digraph 
on n vertices having no more than cn edges, the probability of correct diagnosis for 
Algorithm A is no greater than 1 — c. 

Proof: We must show that, for any c > 0, 3c > 0, no such that if G n (U ny E n ) is 

a sequence of digraphs with |i? n | < cn, then Vn > no, DiagProb^ (A) < 1 — c. Let 
Pq £ Po n he such that faulty processors fail all other processors. Now, let u m | n0n £ 
U n be any vertex of G n such that Vu £ U ny |iV(u m i nc;n )| < |N(u)|. Thus, u mino , re 
is a processor having minimum size neighbor set in G n . Clearly, |A r (t/ m in f - n )| < 2c. 
Now, let 

Snr r Gn = {(5, F) : N( u min(3n ) n (U - F) = 0}. 

Thus, 5urr G(i represents the set of syndrome, fault set pairs in which has 

only faulty processors in its neighbor set. Then, 

P^(Surr G „) = p |JV(u " ,lB «» )l 
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Now, if Surr G represents the complement of Surr Gn , then for any deterministic 
diagnosis algorithm A 

P Gn (Correct^ (A) fl Surr Gf% ) 

= 1 — P Gn (NotCorrect(j n (A) U Surr Gn c ) 

> 1 - P Gn (NotCorrect Gn (A)) - P Gn (Surr Gn c ) 

> p 2c - P^(NotCorrectG? ri (A)) 

Now, consider any (S', F) £. Correct^ (A) n Surr Gn and (S y F) such that if 
timing e F, then F 9 = F - {timing}, otherwise F' = F U {timing}. Thus, (S,F') 
is identical to (5, F) except for the label on u min<3n . Since faulty processors fail all 
other processors, each edge incident on timing will be a one regardless of the state 
°f ' Thus, 

njv.r')) > "i" (rr - p '-~) fi.as.F)) 


and therefore, 


P’ Gn (NotCorrect G „ (A)) > min (~— . ~~~) P Gn (Correct^,, (A) n Surr G „ ) 

> min ^ > ~^T~) [ p2C ~ P ' G " ( NotCorrect Gn ( A ))] 
or 


P Gn (NotCorrect^^ (A)) 


1 + min 




> min 




and 


P Gn (NotCoTiect Gn (A)) > 


minfjVlV =( > 

1 + min( T ^ I i ^ £ ) 


0 


so long as 0 < p < 1. Now, consider any probabilistic diagnosis algorithm A. Then, 
^ p G n eP Gn 

DiagProb G[i (A) < P G„i( S ^ F )) ' P*.s( F ) 

(S,F)€ n«„ 


Consider the deterministic algorithm A' that for any syndrome S chooses fault set 
F such that VF' C U n 


PkMS,F))> i%„((S,F'))- 
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Then, if S represents the set of all syndromes in G n 


DiagProb Gri (A) < 


< 


£ P' Gn ((S,Faulty A .(S)))-pA,s(F) 

(s,F)en Gn 

£'£ P Gn ((S, Faulty (S))) -pa, s(F) 
ses fcu„ 

J2 PG n (( s > Fault y A '( s W X pa,s(f) 

ses Fcu n 

^G„( Correct G n ( A ')) 

1 - e 
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9 Conclusion 

A probabilistic fault model for multiprocessor systems in which processors are faulty 
with probability p has been studied. It has been shown that correct diagnosis can 
be achieved with probability approaching one in a class of systems that conducts 
slightly more than a linear number of tests using a simple and efficient diagnosis 
algorithm. This algorithm also produces a probability of correct diagnosis that 
is arbitrarily close to one in systems conducting a linear number of tests. It has 
also been shown that this result is the best possible, i.e. in systems for which the 
number of tests grows more slowly than n, all diagnosis algorithms, whether they 
be deterministic or probabilistic in nature, must have a probability approaching 
zero of correct diagnosis and furthermore, in systems containing a linear number of 
tests, all algorithms have a probability of correct diagnosis bounded above by some 
constant less than one. In addition, this algorithm has been shown to work with 
high probability on a class of regular systems which contains hypercubes as a special 
case. This result is nearly the best possible as it is known that no algorithm can 
achieve diagnosis with high probability on regular systems of degree o(logn). 


References 

[1] Angluin, D., and L. Valiant, “Fast Probabilistic Algorithms for Hamiltonian 
Circuits and Matchings,” Journal of Computer and System Sciences, vol. 18, 
pp. 155-193, April 1979. 

[2) Blough, D. M., Fault Detection and Diagnosis in Multiprocessor Systems, Ph. D. 
Dissertation, The Johns Hopkins University, 1988. 


25 



[3] Blough, D. M., G. F. Sullivan, and G. M. Masson, “Almost Certain Diagnosis 
for Intermittently Faulty Systems,” Digest of the 18th International Sympo- 
sium on Fault- Tolerant Computing, IEEE Computer Society Press, pp. 260- 
265, 1988. 

[4] Blount, M., “Probabilistic Treatment of Diagnosis in Digital Systems,” Digest 
of the 7th International Symposium on Fault-Tolerant Computing, IEEE Com- 
puter Society Press, pp. 72-77, 1977. 

[5] Chernoff, H., “A Measure of Asymptotic Efficiency for Tests of a Hypothesis 
Based on the Sum of Observations,” Annals of Mathematical Statistics, vol. 23, 
pp. 493-507, 1952. 

[6] Dahbura, A. T., “An Efficient Algorithm for Identifying the Most Likely Fault 
Set in a Probabilistically Diagnosable System,” IEEE Transactions on Com- 
puters, vol. C-35, pp. 354-356, April 1986. 

[7] Dahbura, A. T., K. Sabnani, and L. King, “The Comparison Approach to 
Multiprocessor Fault Diagnosis,” IEEE Transactions on Computers, vol. C-36, 
pp. 373-378, March 1987. 

[8] Erdos, P., and P. Spencer, Probabilistic Methods in Combinatorics, New York: 
Academic Press, 1974. 

[9] Feller, W., An Introduction to Probability Theory and its Applications, New 
York: John Wiley Sons, Inc., 1968. 

[10] Hakimi, S. L., and A. T. Amin, “Characterization of the Connection Assign- 
ment of Diagnosable Systems,” IEEE Transactions dn Computers, vol. C-23, 
pp. 86-88, January 1974. 

[11] Hillis, W. Daniel, The Connection Machine, Cambridge: The MIT Press, 1985. 

[12] Kuhl, J., and S. Reddy, “Distributed Fault-Tolerance for Large Multiproces- 
sor Systems,” Proceedings 7th Annual Symposium on Computer Architecture, 
pp. 23-30, May 1980. 

[13] Maheshwari, S. N., and S. L. Hakimi, “On Models for Diagnosable Systems 
and Probabilistic Fault Diagnosis,” IEEE Transactions on Computers, vol. C- 
25, pp. 228-236, March 1976. 

[14] Mallela, S., and G. M. Masson, “Diagnosis Without Repair for Hybrid Fault 
Situations,” IEEE Transactions on Computers, vol. C-29, pp. 461-470, June 
1980. 


26 


[15] Pelc, A., “Undirected Graph Models for System-Level Fault Diagnosis,” 
Departement d’Informatique Research Report #10, Universite du Quebec a 
Hull, 1988. 

[16] Preparata, F., G. Metze, and R. Chien, “On the Connection Assignment Prob- 
lem of Diagnosable Systems,” IEEE Transactions on Electronic Computers, vol. 
EC-16, pp. 848-854, December 1967. 

[17] Rangarajan, S., and D. Fussell, “A Probabilistic Method for Fault Diagnosis 
of Multiprocessor Systems,” Digest of the 18th International Fault- Tolerant 
Computing Symposium, pp. 278-283, IEEE Computer Society Press, June 1988. 

[18] Scheinerman, E., “Almost Sure Fault Tolerance in Random Graphs,” SIAM 
Journal on Computing, vol. 16, pp. 1124-1134, December 1987. 

[19] Somani, A., V. Agarwal, and D. Avis, “A Generalized Theory for System-Level 
Diagnosis,” IEEE Transactions on Computers, vol. C-36, pp. 538-546, May 
1987. 

[20] Sullivan, G. F., “System-Level Fault . Diagnosability in Probabilistic and 
Weighted Models,” Digest of the 17th International Symposium on Fault- 
Tolerant Computing, IEEE Computer Society Press, pp. 190-195, 1987. 

[21] Yang, C.-L., and Masson, G. M., “A Fault Identification Algorithm for t t - 
diagnosable Systems,” IEEE Transactions On Computers, vol. C-35, pp. 503- 
510, June 1986. 


27 


FFNo 665 Aug 65 


TO 

m 




